1 Speech Is at Least 4 - Dimensional : Receptive Fields in Time - Frequency
نویسنده
چکیده
The successful integration of temporal information is crucial for speech recognition: for example, dynamic time warping and more recently hidden Markov models (HMMs) have been critical in speech recognition technology, and both these methods amount to time-aligning feature vectors to stored word and sentence templates. The correlogram [14] relies on the temporal processing of individual outputs of a cochlea-inspired lter bank. Furui [6] posited the existence of perceptual critical points located in time and containing information crucial for the perception of speech, and this idea inspired the statistical speech recognition model SPAM [11] which focusses modeling power on points of transition or maximal spectral change. This suggests that certain time points of a speech signal are in some sense more \important" than others for a recognition task. With HMMs, the locations of the state transitions in uence the particular realization of a model and therefore its likelihood; states between transitions are of secondary importance. The SPAM results [1, 12] in fact show that we can group these \nontransitional" states together into a single broad category and still achieve a good recognition error rate. All the above methods, however, neglect another potential axis of discrimination, namely the cross-spectrotemporal co-information. Non-convex or disjoint patterns with signi cant temporal and spectral extent cannot be detected directly. Instead, existing methods focus either across frequency on a small slice of time (e.g., LPC cepstral features of a single speech frame) or across time on a small slice of frequency (e.g., the correlogram). There are in fact several results suggesting that the utilization of cross spectro-temporal co-information can have a bene cial e ect on speech processing and recognition. In the arti cial neural network (ANN) speech community [10] it has been shown that using multi-frame context windows can improve recognition scores. The loss of independent feature vectors notwithstanding, ANNs with such \wide" context windows have the potential to learn timeand frequencylike patterns, depending on the features used. In [2], it was shown that a cross-channel correlation algorithm can be used to nd formants in voiced speech in high noise situations. In [5], it was shown that using cross-channel correlation can be used to identify individual sound sources in a mixed auditory scene. Also, in [8], it was suggested that using long-term cross-channel correlation could be used as a measure of speech quality. It can also be argued that the use of cross-spectrotemporal information is biologically plausible. Echoic memory is a temporary bu er in the auditory system that holds pre-attentive information for a brief period of time before subsequent, more detailed, and more taxing processing takes place [13]. It is likely that this storage occurs at the post-cochlear level as we have no evidence for such memory before or during cochlear processing. Therefore, the echoic storage can plausibly be thought of as a form of processed spectro-temporal bu er. Thus assumed, it would be surprising if subsequent processing did not attempt to nd patterns utilizing not just the temporal or spectral axes alone, but shaped regions spanning both time and frequency. Therefore, it may be postulated that the auditory system has the capability, over a 200ms time-span comparable to echoic store, to observe the co-occurrence of information in di erent spectro-temporal regions. Similar to the cells in the visual system that respond to particular shapes, one may consider receptive elds over a form of post-cochlear spectro-temporal plane. Later stages of the auditory system could derive arbitrarily shaped regions that perhaps dynamically scale, shift, and transform according to a variety of control mechanisms. In this paper, we consider a new representation of speech that attempts to explicitly represent non-convex spectrotemporal co-information. Section 2 discusses the computational aspects of our representation. Section 3 illustrates with an example. And nally, section 4 discusses current and future work.
منابع مشابه
Combining Auditory Inspirations and Hierarchical Feature Extraction for Robust Speech Recognition
We present speech features inspired by the processing in the auditory periphery and the receptive fields found in the auditory cortex. They have a hierarchical organization and jointly evaluate variations in the spectrotemporal domain. This is why we termed them Hierarchical Spectro-Temporal (HIST) features. For their calculation we apply a Gammatone filterbank to transform the signal into the ...
متن کاملAn evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex.
1. Using the two-dimensional (2D) spatial and spectral response profiles described in the previous two reports, we test Daugman's generalization of Marcelja's hypothesis that simple receptive fields belong to a class of linear spatial filters analogous to those described by Gabor and referred to here as 2D Gabor filters. 2. In the space domain, we found 2D Gabor filters that fit the 2D spatial ...
متن کاملSpatial summation in the receptive fields of simple cells in the cat's striate cortex.
1. We have examined the responses of simple cells in the cat's atriate cortex to visual patterns that were designed to reveal the extent to which these cells may be considered to sum light-evoked influences linearly across their receptive fields. We used one-dimensional luminance-modulated bars and grating as stimuli; their orientation was always the same as the preferred orientation of the neu...
متن کاملTwo-dimensional modeling of visual receptive fields using Gaussian subunits.
Retinal ganglion cell receptive fields have been successfully described using the difference of Gaussians model introduced by Rodieck. As the basic elements of retinal receptive fields are well described by the Gaussian function, it is natural to model receptive fields beyond this level as a convergence of Gaussian subunits. In this paper the full two-dimensional solution to the problem of calc...
متن کاملChanges of AI receptive fields with sound density.
Primates engage in auditory behaviors under a broad range of signal-to-noise conditions. In this study, optimal linear receptive fields were measured in alert primate primary auditory cortex (A1) in response to stimuli that vary in spectrotemporal density. As density increased, A1 excitatory receptive fields systematically changed. Receptive field sensitivity, expressed as the expected change i...
متن کاملبررسی اثر فیدبک شنوائی در تولید گفتار بعد از عمل کوکلئار ایمپلنت
The main goal of this study is to determine the auditory feedback effects in improvement of speech production process in prelingual totally deaf children who used cochlear implant prosthesis. For this reason, we recorded speech of four prelingual cochlear implant children pre and post of operation. Then we extract some static features of vowels-such as fundamental frequency, formant frequencies...
متن کامل